Ensemble inference of unobserved infections in networks using partial observations

Undetected infections fuel the dissemination of many infectious agents. However, identification of unobserved infectious individuals remains challenging due to limited observations of infections and imperfect knowledge of key transmission parameters. Here, we use an ensemble Bayesian inference method to infer unobserved infections using partial observations. The ensemble inference method can represent uncertainty in model parameters and update model states using all ensemble members collectively. We perform extensive experiments in both model-generated and real-world networks in which individuals have differential but unknown transmission rates. The ensemble method outperforms several alternative approaches for a variety of network structures and observation rates, despite that the model is mis-specified. Additionally, the computational complexity of this algorithm scales almost linearly with the number of nodes in the network and the number of observations, respectively, exhibiting the potential to apply to large-scale networks. The inference method may support decision-making under uncertainty and be adapted for use for other dynamical models in networks.


Epidemic model
For an undirected network, each node is in one of three states: susceptible (S), infected (I) and recovered (R). A susceptible individual can be infected by an infected neighbor with a transmission rate on each day, independent of the states of other neighbors. An infected individual recovers with a probability 1/ , where is the average infectious period. In all experiments, we fixed = 7 days. To represent variations in transmission rates, was drawn from a bimodal distribution, created by superimposing two Gaussian distributions (Fig A). A power-law distribution for transmission rate ( ( ) ∝ − , = 2.5) was also tested in additional experiments.
A set of seeds were selected to initiate outbreaks in a fully susceptible population. Seeds, transmission rate distribution and simulation time were selected to produce outbreaks that infect 25-35% population by the end of simulation (i.e., the infected (I) and recovered (R) individuals account for 25-35% population on the last day). We simulated outbreaks for 7 days in networks with less than 5,000 nodes, and 14 days in larger networks that need more time to infect the target percentage of population. We chose this epidemic regime for a meaningful comparison of inference algorithmsif the percentage of infected population is very high (low), inference that estimates all (none) individuals are infected would produce an artificial high accuracy. In addition, this epidemic regime represents the early to middle stage of outbreaks, when accurate inference is useful for real-time disease control.
We modeled the observational process using daily testing probabilities for people in different states. On each day, each node is randomly tested with a probability depending on the state. We assume infected persons are more likely to be tested than susceptible and recovered persons. The states of tested individuals (S, I or R) are obtained on various testing dates. By varying daily testing probabilities, we can adjust the percentage of observed individuals. In the experiments, we simulated observations of around 15%, 30% and 50% of the total population. Detailed configurations of epidemic models, including distribution of transmission rate and testing probability, are provided in Table A.

Inference framework
The evolution of ( ) ( ∈ ( , , )) can be described by a set of master equations: Here, is the transmission rate of node , is the set of neighbors of node , and is the average infectious duration. Equations (1)(2)(3) are accurate when the network has a tree structure. For real-world networks without too many short loops, previous studies found Equations (1)(2)(3) provide a good approximation [1]. An ensemble of ( ) was initiated at time = 0 by randomly drawing ( 0 ) from a uniform distribution and setting ( 0 ) = 1 − ( 0 ) and ( 0 ) = 0. The transmission rate for each node was drawn from a uniform distribution and was fixed during model simulation.
To estimate ( | = ) ( > ), we use a Bayesian approach: where ( ) is the prior probability and ( = | ) is the likelihood of observing state of individual at time given the state at time . In the sequential inference algorithm, the prior ( ) can be obtained by integrating the posterior estimates of the probabilities of 's states at time − 1 to time . The likelihood ( = | ) can be computed using the master equations in which the state of at time is fixed as , , or . We use the normalization condition to calculate the posterior ( | = ) for = , and . More details are provided in the following section.
Theoretically, it is ideal to estimate the probability of given all future observations, i.e., ( | { } , > ), as the states of nodes in the network may be interdependent. However, this approach requires to run master equations for each observed individual separately to compute the likelihood ({ } , > | ), which can be computationally prohibitive for large networks with a large number of observations. For randomly observed sparse observations, we can assume the interdependency of their states is negligible and estimate the posterior of , To check the validity of the relative independency of observations, we performed an analysis to examine the distance between observed individuals in the contact network. We computed the distribution of the pairwise shortest distance between observed nodes in three real-world networks -Slashdot, Twitter, and Digg ( Fig I). On average, the distance between most pairs of observations is larger than 2 steps. For a small transmission rate, we can assume the observed nodes are relatively independent.

Pseudo-code and implementation
The pseudo-code for the ensemble inference algorithm is provided below. Line 12, model integration, can be performed by running the master equations forward in time.

Backward temporal propagation
Here, we provide more details to perform line 4 in the algorithm (i.e., the backward temporal propagation of information). To estimate ( | = ) ( > ) for each ensemble member, we need to use the prior ( ) and the likelihood ( = | ). We can use the current estimate of ( ) before the update at time as the prior. The major computational task is to estimate the likelihood ( = | ). For instance, to compute ( = | = ) (that is, the probability of observing the state of at time is given that is susceptible at time ), we can set ( = ) = 1 in the master equations at time (keep the probabilities of other nodes as the current estimates) and run the master equations from time to . The likelihood The assumption for the simplified algorithm ENS-I is that the states of observations are relatively independent, so changing the states of other observed nodes won't impact much of the states of the focal nodes. Given the sparse observation and the distance between observed nodes, this assumption is reasonable.
We performed a comparison between the simplified algorithm and the full inference algorithm. For a range of network structures and the percentage of observed individuals, the simplified algorithm yielded similar performance as the full inference algorithm (Figs B, D, E, F, G). Therefore, we presented results obtained from the simplified algorithm in the main text.

Cross-ensemble covariability adjustment
The detail of line 8 (i.e., cross-ensemble covariability adjustment) is provided here. We update the probabilities of observed individuals' neighbors using cross-ensemble covariability.

The modified DMP algorithm
The dynamic message-passing (DMP) algorithm was initially developed to address the question of inferring the origin of an epidemic following the SIR model in networks. Specifically, DMP uses a set of dynamic message-passing equations to describe the evolution of probabilities of individuals in each state (S, I, or R). Define the cavity message → ( ) as the probability that the infection signal has not been passed from node to node up to time when node is fixed to the state , and , → ( ) is the probability that the infection signal has not been passed from node to node up to time when node is fixed to the state and node is in the state at time . The update rules for cavity messages are Here is the transmission probability from node to node , is the recovery probability of node , → ( ) is the probability that node is in the state at time when node is fixed to the state, ( ) is the marginal probability that node is in the state at time , and \ is the set of neighbors of node excluding node .
The above message-passing equations can be iterated in time, starting from initial conditions for cavity messages: where ( ) is the state of node at time , (0) is the initial state of node , and (0), = 1 if and only if (0) = . The marginal probabilities that node is in a given state at time are then given as To infer the origin of an epidemic, we can run the DMP algorithm for each possible seed by setting (0) = 0, → (0) = 1, (0) = 1 for ≠ , and → (0) = 0 for ≠ in the initial condition. The joint probability of the observations is approximated using a mean-field-type approach as a product of the marginal probabilities provided by the dynamic message passing, where is the set of observations, is the observation time, and ( , ) is the marginal probability of the state for node at time estimated using the DMP equations in which node is the epidemic origin. The inferred epidemic origin is selected as the node that maximizes ( | ).

Solving the DMP equations using iterations
The DMP algorithm has a competitive performance in inferring epidemic origin. However, it was not designed specifically for the problem of inferring unobserved infections. We modified the DMP algorithm to estimate the infection probability for each node in this study.
For the observation model used in this study, most observed individuals were only tested once given the low daily testing probability and the relative short simulation period. As a result, it is usually not possible to directly observe the activation or infection times for tested individuals. In addition, as only a small fraction of individuals' states are observed, solving the DMP equations with sparse observation is difficult. Here we propose an iteration approach to approximately solve the DMP equations.
First, we set the initial condition based on sparse observations. For each observed susceptible ( ) individual at time , we set (0) = 1, → ( ) = 1 and → (0) = 0. As the SIR dynamics is irreversible, those individuals remain to be susceptible until the observation time . To incorporate this information, during the iteration from time = 1 to = , we forced ( ) = 1, → ( ) = 1 and → ( ) = 0 for = 1 to . For observed individuals with the states and , their initial states are uncertain. As a result, we assume they are all possibly infected at the beginning of the epidemic. For each observed infected ( ) patient at time , we set (0) = 1 − 1/ , → (0) = 1 and → (0) = 1/ as the initial condition, where is the total number of observed individuals with the state or . During the iteration, we forced ( ) = 0, → ( ) = 0 for = to and → ( ) = 1 (as node is at the state at time ). For each observed recovered ( ) patient at time , we set (0) = 1 − 1/ , → (0) = 1 and → (0) = 1/ as the initial condition. During the iteration, we forced ( ) = 0, → ( ) = 0 and → ( ) = 0 for = to .
Second, to represent the uncertainty in the transmission probability , we used two versions of the modified DMP algorithm. In the first, denoted as DMP1, transmission probabilities were set as a constant, the mean of a uniform distribution; in the second, denoted as DMP2, transmission probabilities for all individuals were randomly drawn from a uniform distribution, same as in the ensemble inference algorithm.
Using the modified DMP algorithm, we integrated the dynamic message passing equations from time = 1 to . The infection probability of individual at time was estimated using ( ).

Discussion on the performance of the modified DMP algorithm
Previous studies showed that the DMP algorithm can support accurate inference of epidemic origin and epidemic parameters [3,4]. However, in this work, the performance of the modified DMP algorithm was not satisfactory for many networks, even for trees where the DMP equations are exact. To confirm that we have implemented the DMP algorithm properly, we reproduced the analysis in Ref. [3] (i.e., infer epidemic origin) in ER networks and found that the algorithm can identify the epidemic origin accurately.
There are two potential reasons for the degraded performance of the modified DMP algorithm in this study. First, the observation model used in this study is not compatible with the one used in the DMP studies. In Ref. [3], the activation times of nodes were observed. The DMP equations were parameterized using information of activation times to capture the local dependency of activation times among neighbors. In this study, we observed the states of a small fraction of nodes at different times, which provide limited information on activation times. Second, solving the DMP equations given a small number of inputs is challenging. In our implementation, we iterated the DMP equations over time while fixing cavity messages for observed nodes. We found that after a few iterations, the cavity messages in the DMP equations tend to converge to the leading eigenvector of the system. A better approximation method may be used to improve the performance of the modified DMP algorithm.

Statistics of real-world networks
Basic statistics of the real-world networks used in experiments are provided in Table B.        Fig F. The modified DMP algorithm with more accurate transmission rates. On an ER random network with = 1598 nodes and an average degree of 2.4, we generated a group of transmission rates according to a bimodal distribution and simulated an outbreak for one week. About 15%, 30% and 45% of all nodes were observed. We used accurate transmission rates for each node in the DMP method in (A-C) and transmission rates with a 10% random noise in (D-E). For the modified DMP algorithm, more accurate transmission rates do not significantly improve the accuracy of inference.    Fig H. Additional experiments in networks with varying observation rates. We show results for networks with more than 5,000 nodes. We compare five different methods: the expedited ensemble inference (ENS-I), modified dynamic message-passing with a fixed transmission rate (DMP1), modified dynamic message-passing with uniformly distributed transmission rates (DMP2), number of connections (Degree), and contact with observed infections (Contact). The original full version of the ensemble inference is computationally expensive for large-scale networks. As a result, we did not run the original algorithm for these networks. We performed the analysis for three real-world networks (Slashdot, Digg, and Twitter). For each network, around 15% nodes were observed. Table A. Configurations of epidemic models and inference. The first column shows the networks used in experiments. Transmission rate distribution is the bimodal distribution of in outbreak simulations, produced by superimposition of two Gaussian distributions ( ( , 2 ) is a Gaussian distribution with the mean and the variance 2 ). Prior transmission rate is the uniform distribution used to sample the individual transmission rates in the ensemble inference. Daily observation rate is the probability of each node in the state S, I or R to be observed on each day. Total observation shows the percentage of observed individuals in the network. The last column shows the corresponding figure in the main text.